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HIGH-DIMENSIONAL GRAPHS AND VARIABLE SELECTION 

WITH THE LASSO 

By Nicolai Meinshausen and Peter Buhlmann 
ETH Zurich 

The pattern of zero entries in the inverse covariance matrix of 
a multivariate normal distribution corresponds to conditional inde- 
pendence restrictions between variables. Covariance selection aims at 
estimating those structural zeros from data. We show that neighbor- 
hood selection with the Lasso is a computationally attractive alter- 
native to standard covariance selection for sparse high-dimensional 
graphs. Neighborhood selection estimates the conditional indepen- 
dence restrictions separately for each node in the graph and is hence 
equivalent to variable selection for Gaussian linear models. We show 
that the proposed neighborhood selection scheme is consistent for 
sparse high-dimensional graphs. Consistency hinges on the choice of 
the penalty parameter. The oracle value for optimal prediction does 
not lead to a consistent neighborhood estimate. Controlling instead 
the probability of falsely joining some distinct connectivity compo- 
nents of the graph, consistent estimation for sparse graphs is achieved 
(with exponential rates), even when the number of variables grows 
as the number of observations raised to an arbitrary power. 

1. Introduction. Consider the p-dimensional multivariate normal dis- 
tributed random variable 

X = (X 1 ,...,X p )~M(v,Z). 

This includes Gaussian linear models where, for example, X\ is the response 
variable and {X^; 2 < k < p} are the predictor variables. Assuming that the 
covariance matrix S is nonsingular, the conditional independence structure 
of the distribution can be conveniently represented by a graphical model 
Q = (T,E), where T = {1,. . . ,p} is the set of nodes and E the set of edges 
in r x r. A pair (a, b) is contained in the edge set E if and only if X a 
is conditionally dependent on Xf,, given all remaining variables ^r\{a,fe} = 
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{Xk;k G r \ {a, 6}}. Every pair of variables not contained in the edge set is 
conditionally independent, given all remaining variables, and corresponds to 
a zero entry in the inverse covariance matrix [12]. 

Covariance selection was introduced by Dempster [3] and aims at discov- 
ering the conditional independence restrictions (the graph) from a set of 
i.i.d. observations. Covariance selection traditionally relies on the discrete 
optimization of an objective function [5, 12]. Exhaustive search is computa- 
tionally infeasible for all but very low-dimensional models. Usually, greedy 
forward or backward search is employed. In forward search, the initial esti- 
mate of the edge set is the empty set and edges are then added iteratively 
until a suitable stopping criterion is satisfied. The selection (deletion) of a 
single edge in this search strategy requires an MLE fit [15] for 0(p 2 ) differ- 
ent models. The procedure is not well suited for high-dimensional graphs. 
The existence of the MLE is not guaranteed in general if the number of 
observations is smaller than the number of nodes [1]. More disturbingly, the 
complexity of the procedure renders even greedy search strategies impracti- 
cal for modestly sized graphs. In contrast, neighborhood selection with the 
Lasso, proposed in the following, relies on optimization of a convex function, 
applied consecutively to each node in the graph. The method is computa- 
tionally very efficient and is consistent even for the high-dimensional setting, 
as will be shown. 

Neighborhood selection is a subproblem of covariance selection. The neigh- 
borhood ne a of a node a G T is the smallest subset of T \ {a} so that, given all 
variables X nea in the neighborhood, X a is conditionally independent of all 
remaining variables. The neighborhood of a node a G V consists of all nodes 
b G r \ {a} so that (a, b) G E. Given n i.i.d. observations of X, neighborhood 
selection aims at estimating (individually) the neighborhood of any given 
variable (or node). The neighborhood selection can be cast as a standard 
regression problem and can be solved efficiently with the Lasso [16], as will 
be shown in this paper. 

The consistency of the proposed neighborhood selection will be shown for 
sparse high-dimensional graphs, where the number of variables is potentially 
growing as any power of the number of observations (high-dimensionality), 
whereas the number of neighbors of any variable is growing at most slightly 
slower than the number of observations (sparsity). 

A number of studies have examined the case of regression with a growing 
number of parameters as sample size increases. The closest to our setting 
is the recent work of Greenshtein and Ritov [8], who study consistent pre- 
diction in a triangular setup very similar to ours (see also [10]). However, 
the problem of consistent estimation of the model structure, which is the 
relevant concept for graphical models, is very different and not treated in 
these studies. 
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We study in Section 2 under which conditions, and at which rate, the 
neighborhood estimate with the Lasso converges to the true neighborhood. 
The choice of the penalty is crucial in the high-dimensional setting. The ora- 
cle penalty for optimal prediction turns out to be inconsistent for estimation 
of the true model. This solution might include an unbounded number of noise 
variables in the model. We motivate a different choice of the penalty such 
that the probability of falsely connecting two or more distinct connectivity 
components of the graph is controlled at very low levels. Asymptotically, 
the probability of estimating the correct neighborhood converges exponen- 
tially to 1, even when the number of nodes in the graph is growing rapidly 
as any power of the number of observations. As a consequence, consistent 
estimation of the full edge set in a sparse high-dimensional graph is possible 
(Section 3). 

Encouraging numerical results are provided in Section 4. The proposed 
estimate is shown to be both more accurate than the traditional forward 
selection MLE strategy and computationally much more efficient. The accu- 
racy of the forward selection MLE fit is in particular poor if the number of 
nodes in the graph is comparable to the number of observations. In contrast, 
neighborhood selection with the Lasso is shown to be reasonably accurate 
for estimating graphs with several thousand nodes, using only a few hundred 
observations. 



2. Neighborhood selection. Instead of assuming a fixed true underlying 
model, we adopt a more flexible approach similar to the triangular setup 
in [8]. Both the number of nodes in the graphs (number of variables), denoted 
by p(n) = |r(n)|, and the distribution (the covariance matrix) depend in 
general on the number of observations, so that T = T(n) and S = S(n). The 
neighborhood ne a of a node a £ T(n) is the smallest subset of T(n) \ {a} so 
that X a is conditionally independent of all remaining variables. Denote the 
closure of node a € T{n) by cl a := ne a U {a}. Then 

X a ±{X k ;keT(n)\cl a }\X Qea . 

For details see [12]. The neighborhood depends in general on n as well. 
However, this dependence is notationally suppressed in the following. 

It is instructive to give a slightly different definition of a neighborhood. For 
each node a S T(n), consider optimal prediction of X a , given all remaining 
variables. Let 6 a £ M p ( n ) be the vector of coefficients for optimal prediction, 

(1) a = a vgmmE(x a - £ 8 k X k \ . 

d - e «=° \ fcer(n) / 

As a generalization of (1), which will be of use later, consider optimal pre- 
diction of X a , given only a subset of variables {X k ;k G A}, where A C 
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r(n) \ {a}. The optimal prediction is characterized by the vector 9' 

(2) 9 a ' A = argmin E[X a - V 

9:0 fc =O,VkjM V fc G r(n) 

The elements of 6 a are determined by the inverse covariance matrix [12]. For 
6eT\{a} and K{n) = E _1 (n), it holds that 0£ = -K ab (n)/K aa (n). The set 
of nonzero coefficients of 9 a is identical to the set {b £ T(n) \ {a} : K a f,(n) / 0} 
of nonzero entries in the corresponding row vector of the inverse covariance 
matrix and defines precisely the set of neighbors of node a. The best predic- 
tor for X a is thus a linear function of variables in the set of neighbors of the 
node a only. The set of neighbors of a node a £ T(n) can hence be written 
as 

ne a = {&£]»: 0^0}. 

This set corresponds to the set of effective predictor variables in regres- 
sion with response variable X a and predictor variables {Xk] k £ T(n) \ {a}}. 
Given n independent observations of X ~ AA(0, S(n)), neighborhood selec- 
tion tries to estimate the set of neighbors of a node a £ As the optimal 
linear prediction of X a has nonzero coefficients precisely for variables in the 
set of neighbors of the node a, it seems reasonable to try to exploit this 
relation. 

2.1. Neighborhood selection with the Lasso. It is well known that the 
Lasso, introduced by Tibshirani [16], and known as Basis Pursuit in the 
context of wavelet regression [2], has a parsimonious property [11]. When 
predicting a variable X a with all remaining variables {X^; k £ T(n) \ {a}}, 
the vanishing Lasso coefficient estimates identify asymptotically the neigh- 
borhood of node a in the graph, as shown in the following. Let the n x p{n)- 
dimensional matrix X contain n independent observations of X, so that the 
columns X a correspond for all a £ T(n) to the vector of n independent ob- 
servations of X a . Let (•,•) be the usual inner product on R" and || • ||2 the 
corresponding norm. 

The Lasso estimate # a ' A of 9 a is given by 

(3) 9 a ' x = argmin(?i~ 1 ||X a - X0||1 + A||0||i), 

e-.e a =o 

where \\9\\i = X)ber(n) 1^1 ^ s the li-norm of the coefficient vector. Normaliza- 
tion of all variables to a common empirical variance is recommended for the 
estimator in (3). The solution to (3) is not necessarily unique. However, if 
uniqueness fails, the set of solutions is still convex and all our results about 
neighborhoods (in particular Theorems 1 and 2) hold for any solution of (3). 

Other regression estimates have been proposed, which are based on the l p - 
norm, where p is typically in the range [0, 2] (see [7]). A value of p = 2 leads to 
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the ridge estimate, while p = corresponds to traditional model selection. 
It is well known that the estimates have a parsimonious property (with 
some components being exactly zero) for p < 1 only, while the optimization 
problem in (3) is only convex for p > 1. Hence Zi-constrained empirical risk 
minimization occupies a unique position, as p = 1 is the only value of p for 
which variable selection takes place while the optimization problem is still 
convex and hence feasible for high-dimensional problems. 

The neighborhood estimate (parameterized by A) is defined by the nonzero 
coefficient estimates of the ^-penalized regression, 



Each choice of a penalty parameter A specifies thus an estimate of the neigh- 
borhood ne a of node a € T(n) and one is left with the choice of a suitable 
penalty parameter. Larger values of the penalty tend to shrink the size of 
the estimated set, while more variables are in general included into ne^ if 
the value of A is diminished. 

2.2. The prediction- oracle solution. A seemingly useful choice of the 
penalty parameter is the (unavailable) prediction-oracle value, 



The expectation is understood to be with respect to a new X, which is 
independent of the sample on which 8 a,x is estimated. The prediction-oracle 
penalty minimizes the predictive risk among all Lasso estimates. An estimate 
of A orac i c is obtained by the cross- validated choice A cv . 

For Zo-P ena h ze d regression it was shown by Shao [14] that the cross- 
validated choice of the penalty parameter is consistent for model selection 
under certain conditions on the size of the validation set. The prediction- 
oracle solution does not lead to consistent model selection for the Lasso, as 
shown in the following for a simple example. 

Proposition 1. Let the number of variables grow to infinity, p(n) — > oo, 
for n — > oo, with p(n) = o(n 7 ) for some 7 > 0. Assume that the covari- 
ance matrices S(n) are identical to the identity matrix except for some pair 
(a, b) £ r(n) x r(n), for which £ a fc(n) = £& a (n) = s, for some < s < 1 and 
all n £ N. The probability of selecting the wrong neighborhood for node a 
converges to 1 under the prediction- oracle penalty, 




{6er(n):^'Vo}. 




P(ne; 



A 



■orac. 



lc / ne a ) -> 1 



for n — > 00. 
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A proof is given in the Appendix. It follows from the proof of Proposition 1 
that many noise variables are included in the neighborhood estimate with the 
prediction-oracle solution. In fact, the probability of including noise variables 
with the prediction-oracle solution does not even vanish asymptotically for a 
fixed number of variables. If the penalty is chosen larger than the prediction- 
optimal value, consistent neighborhood selection is possible with the Lasso, 
as demonstrated in the following. 

2.3. Assumptions. We make a few assumptions to prove consistency of 
neighborhood selection with the Lasso. We always assume availability of n 
independent observations from X ~JV(0, £). 

High- dimensionality. The number of variables is allowed to grow as the 
number of observations n raised to an arbitrarily high power. 

Assumption 1. There exists 7 > 0, so that 



In particular, it is allowed for the following analysis that the number of 
variables is very much larger than the number of observations, p(n) S> n. 

Nonsingularity. We make two regularity assumptions for the covariance 
matrices. 

Assumption 2. For all a E T(n) and n E N, Var(A a ) = 1. There exists 
v 2 > 0, so that for all n E N and a E r(n), 



Common variance can always be achieved by appropriate scaling of the 
variables. A scaling to a common (empirical) variance of all variables is 
desirable, as the solutions would otherwise depend on the chosen units or 
dimensions in which they are represented. The second part of the assump- 
tion explicitly excludes singular or nearly singular covariance matrices. For 
singular covariance matrices, edges are not uniquely defined by the distribu- 
tion and it is hence not surprising that nearly singular covariance matrices 
are not suitable for consistent variable selection. Note, however, that the 
empirical covariance matrix is a.s. singular if p(n) > n, which is allowed in 
our analysis. 




for n 



00. 



Var(X a |A r(n) \ {a} ) > v 2 . 
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Sparsity. The main assumption is the sparsity of the graph. This entails 
a restriction on the size of the neighborhoods of variables. 

Assumption 3. There exists some < k < 1 so that 

max IneJ = O(n^) for n — > oo. 
aer(n) 

This assumption limits the maximal possible rate of growth for the size 
of neighborhoods. 

For the next sparsity condition, consider again the definition in (2) of 
the optimal coefficient for prediction of Xf,, given variables in the set 

icr(n). 

Assumption 4. There exists some # < oo so that for all neighboring 
nodes a, b £ T(n) and all n G N, 

||0a,ne 6 \{a}|| < 

This assumption is, for example, satisfied if Assumption 2 holds and the 
size of the overlap of neighborhoods is bounded by an arbitrarily large num- 
ber from above for neighboring nodes. That is, if there exists some m < oo 
so that for all n € N, 

(4) max |ne a nnef)|<m forn^oo, 

then Assumption 4 is satisfied. To see this, note that Assumption 2 gives a 
finite bound for the ^-norm of while (4) gives a finite bound for 

the ^o- n o rm - Taken together, Assumption 4 is implied. 

Magnitude of partial correlations. The next assumption bounds the mag- 
nitude of partial correlations from below. The partial correlation 7r a b between 
variables X a and Xf, is the correlation after having eliminated the linear ef- 
fects from all remaining variables {X^; k S T(n) \ {a, b}}; for details see [12]. 

Assumption 5. There exist a constant 5 > and some £ > k, with k as 
in Assumption 3, so that for every (a, b) £ E, 

W b \>5n-^l\ 

It will be shown below that Assumption 5 cannot be relaxed in general. 
Note that neighborhood selection for node a S T(n) is equivalent to simul- 
taneously testing the null hypothesis of zero partial correlation between 
variable X a and all remaining variables X},, b S T(n) \ {a}. The null hy- 
pothesis of zero partial correlation between two variables can be tested by 
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using the corresponding entry in the normalized inverse empirical covariance 
matrix. A graph estimate based on such tests has been proposed by Drton 
and Perlman [4]. Such a test can only be applied, however, if the number 
of variables is smaller than the number of observations, p(n) < n, as the 
empirical covariance matrix is singular otherwise. Even if p(n) = n — c for 
some constant c > 0, Assumption 5 would have to hold with £ = 1 to have a 
positive power of rejecting false null hypotheses for such an estimate; that 
is, partial correlations would have to be bounded by a positive value from 
below. 



Neighborhood stability. The last assumption is referred to as neighbor- 
hood stability. Using the definition of 9 a,-A in (2), define for all a, b E T(n), 

(5) S a (b):= £ sign^)^. 

fcGne a 

The assumption of neighborhood stability restricts the magnitude of the 
quantities S a (b) for nonneighboring nodes a,b£ T(n). 



Assumption 6. There exists some 5 < 1 so that for all a, b £ F(n) with 
b £ ne a , 

\S a (b)\<5. 



It is shown in Proposition 3 that this assumption cannot be relaxed. 

We give in the following a more intuitive condition which essentially im- 
plies Assumption 6. This will justify the term neighborhood stability. Con- 
sider the definition in (1) of the optimal coefficients 9 a for prediction of X a . 
For r\ > 0, define 9 a (rf) as the optimal set of coefficients under an additional 
l\ -penalty, 

(6) 9 a {ri):=Bxgm.mE(x a - £ 9 k X k \ + r ] \\9\\ 1 . 

e - d *-° \ fcer(n) / 

The neighborhood ne a of node a was defined as the set of nonzero coefficients 
of 8 a , ne a = {k G T(n) : 9% ^ 0}. Define the disturbed neighborhood ne a (rj) 
as 

ne a ( V ):={keT(n):9 a k (v)^0}. 

It clearly holds that ne a = ne a (0). The assumption of neighborhood stability 
is satisfied if there exists some infinitesimally small perturbation r/, which 
may depend on n, so that the disturbed neighborhood ne (r/) is identical to 
the undisturbed neighborhood ne a (0). 
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Proposition 2. If there exists some rj > so that ne a (rj) = ne a (0), then 
\S a (b)\<lfor a//6er(n)\ne a . 

A proof is given in the Appendix. 

In light of Proposition 2 it seems that Assumption 6 is a very weak con- 
dition. To give one example, Assumption 6 is automatically satisfied under 
the much stronger assumption that the graph does not contain cycles. We 
give a brief reasoning for this. Consider two nonneighboring nodes a and b. 
If the nodes are in different connectivity components, there is nothing left 
to show as S a (b) = 0. If they are in the same connectivity component, then 
there exists one node k £ ne a that separates b from ne a \ {k}, as there is just 
one unique path between any two variables in the same connectivity compo- 
nent if the graph does not contain cycles. Using the global Markov property, 
the random variable is independent of X nCa ^ k y, given X^. The ran- 
dom variable E(Xfy\X nea ) is thus a function of X^ only. As the distribution 
is Gaussian, E{X\,\X nCa ) = 6> fe ' nCa Afc. By Assumption 2, Var(Ab|X nC(i ) = v 2 
for some v 2 > 0. It follows that Var(X b ) = v 2 + (6^' nCa ) 2 = 1 and hence 
0&,ne a _ _ v 2 ^ ^ which implies that Assumption 6 is indeed satisfied 
if the graph does not contain cycles. 

We mention that Assumption 6 is likewise satisfied if the inverse covari- 
ance matrices S _1 (n) are for each n S N diagonally dominant. A matrix 
is said to be diagonally dominant if and only if, for each row, the sum of 
the absolute values of the nondiagonal elements is smaller than the abso- 
lute value of the diagonal element. The proof of this is straightforward but 
tedious and hence is omitted. 

2.4. Controlling type I errors. The asymptotic properties of Lasso-type 
estimates in regression have been studied in detail by Knight and Fu [11] for 
a fixed number of variables. Their results say that the penalty parameter 
A should decay for an increasing number of observations at least as fast as 
n -1 / 2 to obtain an ?i 1 / 2 -consistent estimate. It turns out that a slower rate 
is needed for consistent model selection in the high-dimensional case where 
p(n) 3> n. However, a rate n"^ 1 " 5 )/ 2 with any k < e < £ (where are 
defined as in Assumptions 3 and 5) is sufficient for consistent neighborhood 
selection, even when the number of variables is growing rapidly with the 
number of observations. 

Theorem 1. Let Assumptions 1-6 hold. Let the penalty parameter sat- 
isfy A n ~ dn~( l ~ e ^ 2 with some k. < e < £ and d > 0. There exists some c > 
so that, for all a G r(n), 

-P(he^ C ne a ) = 1 — 0(exp(— cn 6 )) for n — > oo. 
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A proof is given in the Appendix. 

Theorem 1 states that the probability of (falsely) including any of the 
nonneighboring variables of the node a E T(n) into the neighborhood esti- 
mate vanishes exponentially fast, even though the number of nonneighboring 
variables may grow very rapidly with the number of observations. It is shown 
in the following that Assumption 6 cannot be relaxed. 

Proposition 3. // there exists some a,b £ r(n) with b ^ ne a and 
\S a (b)\ > 1, then, for A = X n as in Theorem 1, 

P(^a 5= ne a) — > f or n ~> °°- 

A proof is given in the Appendix. Assumption 6 of neighborhood stability 
is hence critical for the success of Lasso neighborhood selection. 

2.5. Controlling type II errors. So far it has been shown that the proba- 
bility of falsely including variables into the neighborhood can be controlled 
by the Lasso. The question arises whether the probability of including all 
neighboring variables into the neighborhood estimate converges to 1 for 
n — ► co. 

Theorem 2. Let the assumptions of Theorem 1 be satisfied. For A = A n 
as in Theorem 1 , for some c> 

P(ne a C he^) = 1 — 0(exp(— cn e )) for n — > co. 
A proof is given in the Appendix. 

It may be of interest whether Assumption 5 could be relaxed, so that edges 
are detected even if the partial correlation is vanishing at a rate n - ^ 1 "^/ 2 
for some £ < k. The following proposition says that £ > e (and thus £ > k as 
e > k) is a necessary condition if a stronger version of Assumption 4 holds, 
which is satisfied for forests and trees, for example. 

Proposition 4. Let the assumptions of Theorem 1 hold with $ < 1 in 
Assumption 4, except that for a E T(n), let there be some b E T(n) \ {a} with 
7T a b 7^ and \7T a b\ = 0(n~( l ~^l 2 ) for n — > co for some £ < e. Then 

P(b E rie£) -> for n — > co. 

Theorem 2 and Proposition 4 say that edges between nodes for which par- 
tial correlation vanishes at a rate n~( 1 ~^/ 2 are, with probability converging 
to 1 for n — > co, detected if £ > e and are undetected if £ < e. The results do 
not cover the case £ = £, which remains a challenging question for further 
research. 
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All results so far have treated the distinction between zero and nonzero 
partial correlations only. The signs of partial correlations of neighboring 
nodes can be estimated consistently under the same assumptions and with 
the same rates, as can be seen in the proofs. 

3. Covariance selection. It follows from Section 2 that it is possible under 
certain conditions to estimate the neighborhood of each node in the graph 
consistently, for example, 

P(rie^ = ne a ) — > 1 for n — > oo. 

The full graph is given by the set T(n) of nodes and the edge set E = E(n). 
The edge set contains those pairs (a, b) G T(n) x T(n) for which the partial 
correlation between X a and Xf, is not zero. As the partial correlations are 
precisely nonzero for neighbors, the edge set E C Y{n) x T(n) is given by 

E = {(a, fc) : a G ne^, A b G ne a }. 

The first condition, a G ne^, implies in fact the second, b G ne a , and vice 
versa, so that the edge is as well identical to {(a, b) : a G ne& V b G ne a }. For 
an estimate of the edge set of a graph, we can apply neighborhood selection 
to each node in the graph. A natural estimate of the edge set is then given 
by E X > A C T(n) x r(n), where 

(7) E X ' A = {(a,b):aeiie x A&Gne*}. 

Note that a G ne x does not necessarily imply b G ne x and vice versa. We can 
hence also define a second, less conservative, estimate of the edge set by 

(8) i; A ' v = {(o,6):aGne^Vkne^}. 

The discrepancies between the estimates (7) and (8) are quite small in our 
experience. Asymptotically the difference between both estimates vanishes, 
as seen in the following corollary. We refer to both edge set estimates col- 
lectively with the generic notation E x , as the following result holds for both 
of them. 

Corollary 1. Under the conditions of Theorem 2, for some c> 0, 
P(E X = E) = l- 0(exp(-cn e )) for n oo. 

The claim follows since |T(n)| 2 =p(n) 2 = 0(n 27 ) by Assumption 1 and 
neighborhood selection has an exponentially fast convergence rate as de- 
scribed by Theorem 2. Corollary 1 says that the conditional independence 
structure of a multivariate normal distribution can be estimated consistently 
by combining the neighborhood estimates for all variables. 
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Note that there are in total 2^ p2 ~ p ^ 2 distinct graphs for a p-dimensional 
variable. However, for each of the p nodes there are only 2 P_1 distinct poten- 
tial neighborhoods. By breaking the graph selection problem into a consecu- 
tive series of neighborhood selection problems, the complexity of the search 
is thus reduced substantially at the price of potential inconsistencies be- 
tween neighborhood estimates. Graph estimates that apply this strategy for 
complexity reduction are sometimes called dependency networks [9]. The 
complexity of the proposed neighborhood selection for one node with the 
Lasso is reduced further to 0(npmm{n,p}), as the Lars procedure of Efron, 
Hastie, Johnstone and Tibshirani [6] requires 0(mm{n,p}) steps, each of 
complexity 0(np). For high-dimensional problems as in Theorems 1 and 2, 
where the number of variables grows as p(n) ~ en 1 for some c> and 7 > 1, 
this is equivalent to 0(p 2+2 ^) computations for the whole graph. The com- 
plexity of the proposed method thus scales approximately quadratic with 
the number of nodes for large values of 7. 

Before providing some numerical results, we discuss in the following the 
choice of the penalty parameter. 



Finite-sample results and significance. It was shown above that consis- 
tent neighborhood and covariance selection is possible with the Lasso in a 
high-dimensional setting. However, the asymptotic considerations give little 
advice on how to choose a specific penalty parameter for a given problem. 
Ideally, one would like to guarantee that pairs of variables which are not 
contained in the edge set enter the estimate of the edge set only with very 
low (prespecified) probability. Unfortunately, it seems very difficult to obtain 
such a result as the probability of falsely including a pair of variables into 
the estimate of the edge set depends on the exact covariance matrix, which 
is in general unknown. It is possible, however, to constrain the probability of 
(falsely) connecting two distinct connectivity components of the true graph. 
The connectivity component C a C T(n) of a node a € T(n) is the set of nodes 
which are connected to node a by a chain of edges. The neighborhood ne a 
is clearly part of the connectivity component C a . 

Let Ca be the connectivity component of a in the estimated graph (T, E x ). 
For any level < a < 1, consider the choice of the penalty 

(9) X(a) = 



n \2p(n) 

where $ = 1 - $ [$ is the c.d.f. of A/"(0,1)] and d\ = n- 1 {X a ,X a ). The 
probability of falsely joining two distinct connectivity components with the 
estimate of the edge set is bounded by the level a under the choice A = A (a) 
of the penalty parameter, as shown in the following theorem. 
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Theorem 3. Let Assumptions 1-6 be satisfied. Using the penalty pa- 
rameter X(a), we have for aIIu£N that 

P(3aeT(n):C*£C a )<a. 

A proof is given in the Appendix. This implies that if the edge set is 
empty (E = 0), it is estimated by an empty set with high probability, 

P(E X = 0)>l-a. 

Theorem 3 is a finite-sample result. The previous asymptotic results in 
Theorems 1 and 2 hold if the level a vanishes exponentially to zero for an 
increasing number of observations, leading to consistent edge set estimation. 

4. Numerical examples. We use both the Lasso estimate from Section 3 
and forward selection MLE [5, 12] to estimate sparse graphs. We found it 
difficult to compare numerically neighborhood selection with forward selec- 
tion MLE for more than 30 nodes in the graph. The high computational 
complexity of the forward selection MLE made the computations for such 
relatively low-dimensional problems very costly already. The Lasso scheme 
in contrast handled with ease graphs with more than 1000 nodes, using the 
recent algorithm developed in [6]. Where comparison was feasible, the per- 
formance of the neighborhood selection scheme was better. The difference 
was particularly pronounced if the ratio of observations to variables was low, 
as can be seen in Table 1, which will be described in more detail below. 

First we give an account of the generation of the underlying graphs which 
we are trying to estimate. A realization of an underlying (random) graph is 
given in the left panel of Figure 1. The nodes of the graph are associated 
with spatial location and the location of each node is distributed identically 
and uniformly in the two-dimensional square [0,1] 2 . Every pair of nodes is 
included initially in the edge set with probability <p{d/y/p), where d is the 

Table 1 

The average number of correctly identified edges as a function of the number k of falsely 
included edges for n = 40 observations and p — 10, 20, 30 nodes for forward selection 
MLE (FS), E xy , £ A ' A and random guessing 



k 




p = 10 






p = 20 






p = 30 







5 


10 





5 


10 





5 


10 


Random 


0.2 


1.9 


3.7 


0.1 


0.7 


1.4 


0.1 


0.5 


0.9 


FS 


7.6 


14.1 


17.1 


8.9 


16.6 


21.6 


0.6 


1.8 


3.2 




8.2 


15.0 


17.6 


9.3 


18.5 


23.9 


11.4 


21.4 


26.3 




8.5 


14.7 


17.6 


9.5 


19.1 


34.0 


14.1 


21.4 


27.4 
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Fig. 1. A realization of a graph is shown on the left, generated as described in the text. 
The graph consists o/lOOO nodes and 1747 edges out of 449,500 distinct pairs of variables. 
The estimated edge set, using estimate (7) at level a = 0.05 [see (9)], is shown in the 
middle. There are two erroneously included edges, marked by an arrow, while 1109 edges 
are correctly detected. For estimate (8) and an adjusted level as described in the text, the 
result is shown on the right. Again two edges are erroneously included. Not a single pair 
of disjoint connectivity components of the true graph has been (falsely) joined by either 
estimate. 

Euclidean distance between the pair of variables and <p is the density of 
the standard normal distribution. The maximum number of edges connect- 
ing to each node is limited to four to achieve the desired sparsity of the 
graph. Edges which connect to nodes which do not satisfy this constraint 
are removed randomly until the constraint is satisfied for all edges. Initially 
all variables have identical conditional variance and the partial correlation 
between neighbors is set to 0.245 (absolute values less than 0.25 guarantee 
positive definiteness of the inverse covariance matrix); that is, = 1 for all 
nodes a € T, Y,~^ = 0.245 if there is an edge connecting a and b and E^ 1 = 
otherwise. The diagonal elements of the corresponding covariance matrix 
are in general larger than 1. To achieve constant variance, all variables are 
finally rescaled so that the diagonal elements of £ are all unity. Using the 
Cholesky transformation of the covariance matrix, n independent samples 
are drawn from the corresponding Gaussian distribution. 

The average number of edges which are correctly included into the es- 
timate of the edge set is shown in Table 1 as a function of the number of 
edges which are falsely included. The accuracy of the forward selection MLE 
is comparable to the proposed Lasso neighborhood selection if the number 
of nodes is much smaller than the number of observations. The accuracy of 
the forward selection MLE breaks down, however, if the number of nodes is 
approximately equal to the number of observations. Forward selection MLE 
is only marginally better than random guessing in this case. Computation 
of the forward selection MLE (using MIM, [5]) on the same desktop took 
up to several hundred times longer than the Lasso neighborhood selection 
for the full graph. For more than 30 nodes, the differences are even more 
pronounced. 



VARIABLE SELECTION WITH THE LASSO 



15 



The Lasso neighborhood selection can be applied to hundred- or thousand- 
dimensional graphs, a realistic size, for example, biological networks. A graph 
with 1000 nodes (following the same model as described above) and its 
estimates (7) and (8), using 600 observations, are shown in Figure 1. A level 
a = 0.05 is used for the estimate E x,v . For better comparison, the level a 
was adjusted to a = 0.064 for the estimate E X ' A , so that both estimates 
lead to the same number of included edges. There are two erroneous edge 
inclusions, while 1109 out of all 1747 edges have been correctly identified 
by either estimate. Of these 1109 edges, 907 are common to both estimates 
while 202 are just present in either (7) or (8). 

To examine if results are critically dependent on the assumption of Gaus- 
sianity, long-tailed noise is added to the observations. Instead of n i.i.d. ob- 
servations of X ~ jV(0, X), n i.i.d. observations of X + 0.1Z are made, where 
the components of Z are independent and follow a t2-distribution. For 10 
simulations (with each 500 observations), the proportion of false rejections 
among all rejections increases only slightly from 0.8% (without long-tailed 
noise) to 1.4% (with long tailed-noise) for E x,v and from 4.8% to 5.2% for 
E X ' A . Our limited numerical experience suggests that the properties of the 
graph estimator do not seem to be critically affected by deviations from 
Gaussianity. 

APPENDIX: PROOFS 

A.l. Notation and useful lemmas. As a generalization of (3), the Lasso 
estimate # a >-4> A of 9 a, - A , defined in (2), is given by 

(A.l) 0<*,AA = argmin (n _1 ||X - X0\\% + A||0||i). 

9:9 k =OVk£A 

The notation 9 a ' X is thus just a shorthand notation for a > r O)\{a}A _ 

Lemma A.l. Given 9 G M p ( n ), let G{6) be a p(n)- dimensional vector 
with elements 

G b (9) = -2n~ 1 (X a -X9,X b ). 

A vector 9 with 9^ = 0, V/c S T(n) \A is a solution to (A.l) iff for all b£ A, 
G b {9) = — sign(#t,)A in case 9 b ^ and \G b {0)\ < A in case 9 b = 0. Moreover, 
if the solution is not unique and \G b (9)\ < A for some solution 9, then 9 b = 
for all solutions of (A.l). 

Proof. Denote the subdifferential of 

n -1 ||Xa-X0||! + A||0||i 
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with respect to 9 by D{6). The vector 9 is a solution to (A.l) iff there 
exists an element d G D{9) so that db = 0, V6 G A. D(9) is given by {G(9) + 
Ae, e£5}, where S C M p ( n ) is given by S := {e G : e b = sign(0 6 ) if ^ 
and eb G [—1,1] if #6 = 0}. The first part of the claim follows. The second 
part follows from the proof of Theorem 3.1. in [13]. □ 

Lemma A.2. Let § a ' ne ^' X be defined for every a G r(n) as in (A.l). Un- 
der the assumptions of Theorem, 1, for some c > 0, for all a G r(n), 

P(sign(^' nCa,A ) = sign(6>j), V6 G ne a ) = 1 — 0(exp(— cn e )) /or n — > oo. 

For the sign-function, it is understood that sign(0) = 0. The lemma says, 
in other words, that if one could restrict the Lasso estimate to have zero co- 
efficients for all nodes which are not in the neighborhood of node a, then the 
signs of the partial correlations in the neighborhood of node a are estimated 
consistently under the given assumptions. 

Proof. Using Bonferroni's inequality, and \ne a \ = o(n) for n — > oo, it 
suffices to show that there exists some c > so that for every a,b G T(n) 
with b G ne a , 

P(sign(0£' ne °' A ) = sign(0£)) = 1 - 0(exp(-cn £ )) for n -> oo. 
Consider the definition of 9 a ' nCa ' X in (A.l), 

(A.2) §a,ne a ,X = arg min („-l || Xo - X9\\ 2 2 + X\\9\\ i). 

d:9 k =0Vk<£ne a 

Assume now that component b of this estimate is fixed at a constant value 
(3. Denote this new estimate by 9 a,b,x (f3), 

(A.3) a > b >\0)= argmm(n~ 1 \\X a -X9\\ 2 2 + X\\9\\ 1 ), 

where 

&aM ■■= i° G RP(n) :O b = P',0 k = O,Vkt ne a }. 

There always exists a value /3 (namely /3 = ^' nCa,A ) so that 9 a,b,x (f3) is iden- 
tical to 9 a,nea,x . Thus, if sign(^' nCa ' A ) ^sign(0^), there would exist some /3 
with sign(/3) sign (9f) < so that 9 a ^ x (P) would be a solution to (A.2). Us- 
ing sign(^) 7^ for all b G ne a , it is thus sufficient to show that for every (3 
with sign(/3) sign(#^) < 0, 9 a ' b ' X ((3) cannot be a solution to (A.2) with high 
probability. 

We focus in the following on the case where 9% > for notational simplic- 
ity. The case 9% < follows analogously. If 6*^ > 0, it follows by Lemma A.l 
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that § a ' b ' x ((3) with 8 b ,b,X (P) = (5 < can be a solution to (A.2) only if 
G b (8 a,b ' X ((3)) > —A. Hence it suffices to show that for some c > and all 
b £ ne a with 8^ > 0, for n — > oo, 

(A.4) pfsup{G 6 (r AA (/?))} < - a) = 1 - 0(exp(-cn e )). 

V/3<0 / 

Let in the following R A (/3) be the n-dimensional vector of residuals, 
(A.5) R A (/?):=X a -X6> a > 6 > A (/3). 
We can write X b as 

(a.6) x b = x: ^ ncaU6} ^+w 6 , 

fc€ne a \{&} 

where Wfe is independent of {X^\ k S ne a \ {b}}. By straightforward calcula- 
tion, using (A.6), 

G b (6^ x (P)) = -2n" 1 (R A (/?), W 6 > - £ ^' ncAW (2n" 1 (R A (/3),X fc )). 

k£ne a \{b} 

By Lemma A.l, for all fcene a \{&}, \Gk{8 a,b,x {j3))\ = |2n~ 1 (R^(/3),Xfc)| < A. 
This together with the equation above yields 

(A.7) G b (8 a ' b > x (P)) < -2?i- 1 (R A (/?), W 6 ) + X\\8 b ' nCa \ {b} \\ v 

Using Assumption 4, there exists some i? < oo, so that ||0 6 > no «A{ 6 >||i < t?. For 
proving (A.4) it is therefore sufficient to show that there exists for every 
g > some c> so that for all b £ ne a with 8% > 0, for n — ► oo, 

(A.8) pQnf{2n- 1 (R A (/?),W b )}> 5 A) = 1 - 0(exp(-cn £ )). 

With a little abuse of notation, let C W 1 be the at most (|ne a | — 1)- 
dimensional space which is spanned by the vectors {X&, k G ne a \ {b}} and 
let W ± be the orthogonal complement of in M n . Split the n-dimensional 
vector Wf, of observations of W b into the sum of two vectors 

(A.9) W 6 = W 6 ± +wJ, 

where wjj is contained in the space WW C M n , while the remaining part 
W^- is chosen orthogonal to this space (in the orthogonal complement W 1 - 
of W"). The inner product in (A.8) can be written as 

(A.10) 2 ? ^ 1 (R A (/3),W b ) = 2n- 1 (R A (/3),W b ± ) + 2n- 1 (R A (/3),wJ). 
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By Lemma A. 3 (see below), there exists for every g > some c > so that, 
for n — > oo, 

P(inf {2n" 1 (R^(/3), wj)/(l + \/3\)} > -gX^j = 1 - 0(exp(-cn £ )). 

To show (A. 8), it is sufficient to prove that there exists for every g > some 
c > so that, for n — > oo, 

(A.ll) p(jnf {2n- 1 (R x a (l3), W^) - g(l + \(3\)X} > gX") = 1 - 0(exp(-m £ )). 
For some random variable V a , independent of X nCa , we have 

x a = o a k x k + v a . 

fc£ne a 

Note that V a and W b are independent normally distributed random variables 
with variances a 2 and a 2 , respectively. By Assumption 2, < v 2 < a 2 , a 2 < 1. 
Note furthermore that W b and X nCa \^ b y are independent. Using 9 a = a ' nea 
and (A.6), 

(A.i2) x a = Yl {ot + e a b e b k ^ {b} )x k + e a b w b + v a . 

k£ne a \{b} 

Using (A. 12), the definition of the residuals in (A. 5) and the orthogonality 
property of , 

2n- 1 (R^(/?),W fe ± ) = 2n- 1 (^-/3)(W b ± ,W b ± ) + 2n- 1 (V a ,W b ± ), 
1 } >2n- 1 (^-/3)(W fe ± ,W b ± )-|2n- 1 (V a ,W fe ± )|. 

The second term, |2n _1 (V a , W^-)|, is stochastically smaller than |2?i _1 (V a , W;,) 
(this can be derived by conditioning on {~K k ;k £ ne a }). Due to independence 
of V a and Wb, E(V a W b ) = 0. Using Bernstein's inequality (Lemma 2.2.11 in 
[17]), and A ~ dn~^~ e ^ 2 with e > 0, there exists for every g > some c > 
so that 

( , P(|2n' 1 (V a , W^l > g\) < P^n' 1 (V B , W 6 )| > gX) 
K ' = 0(exp(-cn e )). 

Instead of (A.ll), it is sufficient by (A. 13) and (A. 14) to show that there 
exists for every <7>0ac>0so that, for n — > oo, 

(A.15) P (^ f o {2n " 1( ^ " P)(™bM) ~ 9(1 + \/3\)X} > 2 5 A 
= l-0(exp(-cn e )). 

Note that & b 2 (W b , W^-) follows a Xn-\ nCa \ distribution. As |ne a | = o(n) and 

fft > v 2 (by Assumption 2), it follows that there exists some k > so that 
for n > no with some n^ik) G N, and any c> 0, 

P(2n" 1 (W^, W^) >&) = !- 0(exp(-cn e )). 
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To show (A. 15), it hence suffices to prove that for every k,£ > there exists 
some no(k,£) 6 N so that, for all n > no, 

(A.16) M{(0$-/3)k-e0. + \/3\)\}>0- 

By Assumption 5, \ir a b\ is of order at least n - ^ 1- ^/ 2 . Using 

vr ab = ^/(Var(A a |X r(n)Ua} ) Var(X 6 |X r(n)V{b} )) 1/2 

and Assumption 2, this implies that there exists some q > so that 8% > 
qn~ ( - 1 ~^/ 2 . As A ~ dn~^ l ~ £S} / 2 and, by the assumptions in Theorem 1, £ > e, 
it follows that for every > and large enough values of n, 

d a h k - £\ > 0. 

It remains to show that for any k,£ > there exists some no(k,£) so that for 
all n > no, 

inf{-/3fc-^|/?|A} >0. 

/3<0 

This follows as A — > for n — > oo, which completes the proof. □ 

Lemma A. 3. Assume the conditions of Theorem 1 hold. Let R^(/3) be 

defined as in (A. 5) and wjj as in (A. 9). For any g > i/iere exists c> so 
that for all a, b G T(n), /or n — ► oo, 

pfsup^n-^R^/?), Wj)|/(1 + \p\) <g\)=l- 0(exp(-cn £ )). 

V/3GK / 

Proof. By Schwarz's inequality, 

(A.17) iJJn-^GS), wj)|/(l + < 2n-V2||wl||| 2 n " 1/ ' l ^ )l12 . 

The sum of squares of the residuals is increasing with increasing value of A. 
Thus, ||R*(/3)||| < ||RS°G9)||1. By definition of in (A.5), and using (A.3), 

||R^(/3)|| 2 = ||X a -/3X 6 || 2 , 

and hence 

ll^(/3)lli<(l + l/3|) 2 max{||X a || 2 ,||X 6 || 2 }. 
Therefore, for any q > 0, 

/ p n-V2||RA(/3)|| 2 > \ p(n -i/2 max{ || Xa || 2) || X6 || 2} > q y 
V/3eE -L + lPl / 
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Note that both ||X a || 2 and ||Xft||| are Xn-distributed. Thus there exist q > 1 
and c> so that 

(A.18) p( sup - V2||R y^ 2 >q ) =0(exp(-cn £ )) forrwoo. 

V/3GIR 1 + |P| / 

It remains to show that for every g > there exists some c > so that 
(A.19) i^n" 1 / 2 1| W|]||2 > gX) = 0(exp(-cn e )) forrwoo. 

The expression o"^ 2 (w||, wjj) is Xh e i_i-distributed. As dj, < 1 and |ne | = 

0(n K ), it follows that rz x / 2 1 1 "W"|| 1 1 2 is for some t > stochastically smaller 
than 

tn -(l-«)/2 (z/n K ) l/2 ; 

where Z is x«« -distributed. Thus, for every g > 0, 

P( ? i- 1 / 2 ||WS|| 2 > ffA) < P((Z/ ? i K ) > (g/tfn^X 2 ). 

As A^ 1 = 0(n^~ 6 ^ 2 ), it follows that n 1_K A 2 > hn 6 ~ K for some /i > and 
sufficiently large n. By the properties of the x 2 distribution and e > k, by 
assumption in Theorem 1, claim (A.19) follows. This completes the proof. 
□ 

Proof of Proposition 1. All diagonal elements of the covariance 
matrices S(n) are equal to 1, while all off-diagonal elements vanish for 
all pairs except for a,b G T(n), where £ a b(n) = s with < s < 1. Assume 
w.l.o.g. that a corresponds to the first and b to the second variable. The 
best vector of coefficients 9 a for linear prediction of X a is given by 9 a = 
(0, —K a i,/K aa ,0, 0, . . .) = (0, s, 0, 0, . . .), where K = A necessary con- 

dition for ne„ = ne a is that 9 a ' X = (0, r, 0,0,.. .) is the oracle Lasso solution 
for some In the following, we show first that 

(A.20) P(3A,T>s:# a ' A = (0,r,0,0,...))^0, ra->oo. 

The proof is then completed by showing in addition that (0, r, 0, 0, . . .) cannot 
be the oracle Lasso solution as long as r < s. 

We begin by showing (A.20). If 6 = (0, r, 0, 0, . . .) is a Lasso solution for 
some value of the penalty, it follows that, using Lemma A.l and positivity 
of r, 

(A.21) (Xi-rX 2 ,X 2 ) >|(Xi-rX 2 ,X fc )| V k G T(n), k > 2. 

Under the given assumptions, X 2 , A3, . . . can be understood to be indepen- 
dently and identically distributed, while X\ = sX 2 + W\ , with W\ indepen- 
dent of (X 2 ,X 3 ,...). Substituting X x = sX 2 + W x in (A.21) yields for all 
k G T(n) with k > 2, 

(Wi,X 2 ) - (r- S )(X 2) X 2 ) > |(Wi,X fc ) - (r-a)(X 2) X fc )|. 
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Let U2, U3, . . . , U p i n ) be the random variables defined by Uk = (Wi,Xfc). 
Note that the random variables Iff., k = 2, . . . ,p(n), are exchangeable. Let 
furthermore 

£> = (X 2 ,X 2 )- max |(X 2 ,X fc )|. 

fcGr(n),fe>2 

The inequality above implies then 

U2 > max Uk + (t — s)D. 

ker(n),k>2 

To show the claim, it thus suffices to show that 

(A.22) P{U 2 > max U k + (r - s)D ] -> forn->oo. 
V fcer(n),fc>2 / 

Using t — s > 0, 

P[U 2 > max ?7 fe + (T-s)L> ) <P( J7 2 > max C/ fc ) + P(D < 0). 

\ fcer(n),fc>2 / V fcGr(n),fc>2 / 

Using the assumption that s < 1, it follows by p(n) = o(n 7 ) for some 7 > 
and a Bernstein-type inequality that 

P(D < 0) for n -> 00. 

Furthermore, as C/ 2 , • • • , C^,( n ) are exchangeable, 

P[U2> max t/fc ) = (p(n) — — ► for n — > co, 

V fcgr(n),fe>2 / 

which shows that (A.22) holds. The claim (A. 20) follows. 

It hence suffices to show that (0, r, 0,0,.. .) with r < s cannot be the oracle 
Lasso solution. Let r max be the maximal value of r so that (0, r, 0, . . .) is a 
Lasso solution for some value A > 0. By the previous assumption, r max < s. 
For r < r max , the vector (0, r, 0, . . .) cannot be the oracle Lasso solution. We 
show in the following that (0, r max , 0, . . .) cannot be an oracle Lasso solution 
either. Suppose that (0, r max , 0, 0, . . .) is the Lasso solution 9 a ' X for some 
A = A > 0. As 

''"max is the maximal value such that (0, t, 0, ...) is a Lasso 
solution, there exists some k € T(n) > 2, such that 

|^ _1 (Xi - r max X 2 ,X 2 )| = |n _1 (Xi - r max X 2 ,X fc )|, 

and the value of both components G 2 and of the gradient is equal to 
A. By appropriately reordering the variables we can assume that k = 3. 
Furthermore, it holds a.s. that 

max |(Xi -r max X 2 ,X fc )| < A. 

fcer(n),fc>3 
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Hence, for sufficiently small <5A > 0, a Lasso solution for the penalty A — 6X 
is given by 



Let H n be the empirical covariance matrix of (X2,Xs). Assume w.l.o.g. 
that n _1 (Xi -r max X 2 ,X fc ) > and n~ 1 (X 2 ,X 2 ) = ra _1 (X3,X 3 ) = 1. Fol- 
lowing, for example, Efron et al. ([6], page 417), the components (662,663) 
are then given by H~ l (l, 1) T , from which it follows that 662 = 663, which 
we abbreviate by 66 in the following (one can accommodate a negative sign 
for n _1 (Xi — r max X2,X / ! c ) by reversing the sign of 663). Denote by L$ the 
squared error loss for this solution. Then, for sufficiently small 66, 



It holds that Lse — Lq <0 for any < 69 < 1/2 (s — r max ), which shows that 
(0, r, 0, ...) cannot be the oracle solution for r < s. Together with (A. 20), 
this completes the proof. □ 

Proof of Proposition 2. The subdifferential of the argument in (6), 



where e k G [-1,1] if 6% = 0, and e k = sign(6>£) if 6% / 0. Using the fact that 
ne a (77) = ne a , it follows as in Lemma A.l that for all k G ne a , 



(0,r max + 50 2 , 503,0,... )■ 



L s -L = E(X 1 - (r max + 69)X 2 + 66X 3 ) 2 - E{X X - r max X 2 ) 2 
= (s- (r max + 66)f + 69 2 - (s - r max ) 2 
= -2(s-T max )66 + 266 2 . 



\ mer(n) / 
with respect to k G T(n) \ {a}, is given by 





(A.23) 




and, for b ^ ne, 




A variable X^ with b ^ ne a can be written as 



x b = E o b k ' nCa Xk + w b 
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where Wb is independent of {X^; k G cl a }. Using this in (A. 24) yields 

kene a \ \ mer(n) / / 



Using (A.23) and 6 a = 0°'' 



>) 

it follows that 



e k sign(0 fc ' 



<1, 



which completes the proof. □ 



PROOF of Theorem 1. The event ne x ne a is equivalent to the event 
that there exists some node b G T(n) \ cl a in the set of nonneighbors of node 
a such that the estimated coefficient 9^ ,X is not zero. Thus 



(A.25) 



P(net C ne a ) = 1 - P(3 b G T(n) \ cl a : 6^ X + 0) 



Consider the Lasso estimate # a > ne < i > A ) which is by (A.l) constrained to 
have nonzero components only in the neighborhood of node a G r(n). Using 
|ne a | = 0(n K ) with some k < 1, we can assume w.l.o.g. that |ne a | < n. This 
in turn implies (see, e.g., [13]) that ^ a > ne aA is a.s. a unique solution to (A.l) 
with A = ne a . Let £ be the event 

max \G k (6 a ^ x )\ < X. 
fcer(n)\ci a 

Conditional on the event £, it follows from the first part of Lemma A.l that 
(?a,ne a ,A is not only a solution of (A.l), with .A = ne a , but as well a solution 
of (3), where A = T(n) \ {a}. As §%' nea ' X = for all b G T(n) \ cl a , it follows 
from the second part of Lemma A.l that §^ ,X = 0, V6 G T(n) \ cl a . Hence 

P(3 b G r(n) \ cl a : 9 a b ' X + 0) < 1 - P(£) 



P 



max \G k (9 a ' nCa ' X )\>X , 
fcer(n)\ci„ 



where 

(A.26) G 6 (0 a ' nC[ " A ) = -2n- 1 (X a -X0 a ' ne - A ,X b ). 

Using Bonferroni's inequality and p(n) = 0(n 7 ) for any 7 > 0, it suffices to 
show that there exists a constant c > so that for all b G T(n) \ cl a , 

(A.27) P{\G b (9 a ' nCa ' X )\ > A) = 0(exp(-cn e )). 

One can write for any b G T(n) \ cl a , 

(A.28) X b = E ^ nCa ^ m + ^-, 

m£ne„ 
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where V& ~ A/"(0, a%) f° r some of < 1 and Vj, is independent of {X m ; m € cl a }. 
Hence 

Gb ^a,nc a ,A) = _ 2n -l ^ ^(X fl -Xr l,C «' A ,X m ) 

m£ne a 

-2n- 1 (X a -X0 a ' ne ' 1 ' A ,V b ). 

By Lemma A. 2, there exists some c > so that with probability 1 — 
0(exp(— cn 6 )), 

(A.29) sign(^' nea ' A )=sign(^' nc<1 ) VA;ene a . 



In this case by Lemma A.l 

\m€nc a 

If (A.29) holds, the gradient is given by 



2n" X £ e e "(X a -Xr^ A ,X m )=( £ sign(0« 

mSne a \m€ne a 



Gir e "' A ) = - E sign(C nCtl )^ n0Q U 



. m£nc a 



(A.30) 

-2n~ 1 (X a -X# a ' nCa ' A ,V 6 ). 
Using Assumption 6 and Proposition 2, there exists some 5 < 1 so that 



£ sign(C nea )e en 



<<5. 



The absolute value of the coefficient Gb of the gradient in (A. 26) is hence 
bounded with probability 1 — 0(exp(— cn £ )) by 

(A.31) \G b (9 a ' nCa ' x )\ <5X + |2n -1 (Xo - X0 a ' no< »\ V 6 )|. 

Conditional on X c j a = {X&; k £ cl a }, the random variable 

(X a -X^' ne °' A ,V b > 

is normally distributed with mean zero and variance o^ ||X a — X# a ' ne<1 ' A ||2. 
On the one hand, a\ < 1. On the other hand, by definition of # a ' n0a > A ; 

||X a -X^ ne - A || 2 < ||X || 2 . 

Thus 

|2n- 1 (X a -X0 a ' ne °' A ,V 6 )| 

is stochastically smaller than or equal to |2n~ 1 (X a , Vfe)|. Using (A.31), it 
remains to be shown that for some c> and 5 < 1, 

P(|2n- 1 (X a , V 6 )| > (1 - 8)X) = 0(exp(-cn £ )). 
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As Vb and X a are independent, E(X a Vb) = 0. Using the Gaussianity and 
bounded variance of both X a and Vb, there exists some g < oo so that 
E(exp(\X a Vb\)) <g. Hence, using Bernstein's inequality and the bounded- 
ness of A, for some c > 0, for all b £ ne a , P(|2n _1 (X a , V&)| > (1 — 5)\) = 
0(exp(— cnA 2 )). The claim (A. 27) follows, which completes the proof. □ 

Proof of Proposition 3. Following a similar argument as in Theo- 
rem 1 up to (A. 27), it is sufficient to show that for every a, b with b € r(ra) \cl a 
and | So (6) | > 1, 

(A.32) P(|G 6 (# a ' nCa ' A )| > A)->1 forrwoo. 

Using (A. 30) in the proof of Theorem 1, one can conclude that for some 
5 > 1, with probability converging to 1 for n — > oo, 

(A.33) \G b a ' nea ' x )\>5\-\2n- 1 (X a -lL§ a > nSa ' x ,V b )\. 

Using the identical argument as in the proof of Theorem 1 below (A.31), for 
the second term, for any g > 0, 

P(|2n- 1 (X a -Xr' nc - A ,V ;) )| >g\)^0 forn^co, 

which together with 6 > 1 in (A.33) shows that (A.32) holds. This completes 
the proof. □ 

Proof of Theorem 2. First, P(ne a C he A ) = 1 - P(3b£ ne a :6^ X = 
0). Let £ be again the event 

(A.34) max \G k (6 a ^ x )\ < X. 

feer(n)\ci 

Conditional on £ , we can conclude as in the proof of Theorem 1 that # a > nea > A 
and 6 a,x are unique solutions to (A.l) and (3), respectively, and (? a > nea > A = 
§ a ' x . Thus 

P(3 6 G ne a : = 0) < P(3 6 G ne a : ^' nGa ' A = 0) +P(£ C ). 

It follows from the proof of Theorem 1 that there exists some c> so that 
P(£ c ) = 0(exp(— cn 6 )). Using Bonferroni's inequality, it hence remains to 
show that there exists some c> so that for all b £ ne a , 

(A.35) p0a,ne a ,X = Q) = ( exp (_ cn£ )). 

This follows from Lemma A. 2, which completes the proof. □ 

Proof of Proposition 4. The proof of Proposition 4 is to a large 
extent analogous to the proofs of Theorems 1 and 2. Let £ be again the 
event (A.34). Conditional on the event £ , we can conclude as before that 
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ga,ne a ,\ an( j ga,\ are un iq Ue solutions to (A.l) and (3), respectively, and 

§a,ne a ,X = §a,X for &Qy & £ ^ 

P(b $ he A ) = P(^' A = 0) > P(§^ x = 0\£)P(£). 
Since P(£ ) — > 1 for n — > oo by Theorem 1, 

p ^a,nc a ,x = |5)p(5) _> p(g«.™».> = o) for rw oo. 

It thus suffices to show that for all 6 € ne a with | vr^t, | = 0(n~( 1- ^/ 2 ) and 
£<e, 

P0*' nea ' X = O)^l forrwoo. 

This holds if 

(A.36) P{\G b {9 a ' nc ^' X )\ < A) — 1 for rw oo, 

as |G^ a > nC[l \W' A )| < A implies that 0°' M «\W> A = (9 a < nc <» A and hence ^> ne[ " A = 
0. Using (A.7), 

| Gb ^a, nCa \W,A)| < |2 n - 1 (R^(0),W 6 )| +A||fl 6 ' ne »\W|| 1 . 

By assumption ||^ 6 ' nGa \{ b }|| 1 < 1. It is thus sufficient to show that for any 
9>0, 

(A.37) P(|2n~ 1 (R A (0),W 6 }| < #A) ^ 1 forn^oo. 

Analogously to (A. 10), we can write 

(A.38) \2n~ l (R x (0), W b )\ < ^(R^O), W^}| + ^(R^O), wf)|. 

Using Lemma A. 3, it follows for the last term on the right-hand side that 
for every g > 0, 

P(|2n- 1 (R A (0),W|)| <g\)^l for n -> oo. 
Using (A. 13) and (A. 14), it hence remains to show that for every g > 0, 
(A.39) P(|2n~X ,nea (W4-,W4-)|<sA)^l for oo. 

We have already noted above that the term cr^ 2 (W^-, W^-) follows a Xn~\nc a \ 
distribution and is hence stochastically smaller than a ^-distributed ran . 
dom variable. By Assumption 2, a 2 < 1. Furthermore, using Assumption 2 
and |7r a6 | = 0(n"( 1 -«)/ 2 ), |^' nCa | = OirT^I 2 ). Hence, with A ~ dn^ 1 '^ 2 , 
it follows that for some constant k > 0, A/|6^' nCa | > yfcn( £ -«)/ 2 . Thus, for some 
constant c> 0, 

(A.40) P(|2n-X' neB (W^ W^-}| < 5 A) > P(Z/n < cn^' 2 ), 
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where Z follows a \\ distribution. By the properties of the y 2 distribution 
and the assumption £ < e, the right-hand side in (A. 40) converges to 1 for 
n — > oo, from which (A. 39) and hence the claim follow. □ 

Proof of Theorem 3. A necessary condition for C x £ C a is that there 
exists an edge in E x joining two nodes in two different connectivity compo- 
nents. Hence 

P(3aeT(n):C x tC a )<p(n) max P(3 b € T(n) \ C a : b e he*). 

a£F(n) 

Using the same arguments as in the proof of Theorem 1, 

P(36€r(n)\C a :6Gne^<pf max \G b (§ a >°*> x )\ > x) , 

\ber(n)\c a J 

where Q a ^ Ca ^, according to (A.l), has nonzero components only for variables 
in the connectivity component C a of node a. Hence it is sufficient to show 
that 

(A.41) p{nf max P(\G b (6 a ^ x )\ > A) < a. 

a£r(n),b£r(n)\C a 

The gradient is given by G b (9 a ' Ca ' X ) = -2n" 1 (X a - X6 a ' Ca ' X ,X 6 ) . For all 
k € C a the variables X b and X\. are independent as they are in different 
connectivity components. Hence, conditional on Xc n = {X^; k £ C a }, 

G b (9 a > c «> x )~Ar(0,R 2 /n), 

where R 2 = 4n _1 ||X a — ~K9 a,Ca,x \\2, which is smaller than or equal to a 2 = 
4n~ 1 ||X a ||2 by definition of a > Ca ' X . Hence for all a £ T(n) and b £ T(n)\C a , 

P{\G b {9 a ^ x )\ > A|X C J < 2d(v^A/(2ff )), 

where <£ = 1 — <&. It follows for the A proposed in (9) that P(\G b (6 a ' Ca ' X )\ > 
A|X C J < ap(n)- 2 , and therefore P(\G b {0 a > Ca ' X )\ > A) < ap{n)~ 2 . Thus (A.41) 
follows, which completes the proof. □ 
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